Pesquisa | Biblioteca Virtual em Saúde

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation.

Wang, Peng; Li, Yong; Yang, Liang; Li, Simin; Li, Linfeng; Zhao, Zehan; Long, Shaopei; Wang, Fei; Wang, Hongqian; Li, Ying; Wang, Chengliang.

JMIR Med Inform ; 10(8): e38154, 2022 Aug 30.

Artigo em Inglês | MEDLINE | ID: mdl-36040774

RESUMO

BACKGROUND: With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning-based, or deep learning-based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language. OBJECTIVE: This paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification. METHODS: We propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records. RESULTS: We compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods. CONCLUSIONS: Compared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.

A Bi-level representation learning model for medical visual question answering.

Li, Yong; Long, Shaopei; Yang, Zhenguo; Weng, Heng; Zeng, Kun; Huang, Zhenhua; Lee Wang, Fu; Hao, Tianyong.

J Biomed Inform ; 134: 104183, 2022 10.

Artigo em Inglês | MEDLINE | ID: mdl-36038063

RESUMO

Medical Visual Question Answering (VQA) targets at answering questions related to given medical images and it contains tremendous potential in healthcare services. However, researches on medical VQA are still facing challenges, particularly on how to learn a fine-grained multimodal semantic representation from relatively small volume of data resources for answer prediction. Moreover, the long-tailed distribution labels of medical VQA data frequently result in poor performance of models. To this end, we propose a novel bi-level representation learning model with two reasoning modules to learn bi-level representations for the medical VQA task. One is sentence-level reasoning to learn sentence-level semantic representations from multimodal input. The other is token-level reasoning that employs an attention mechanism to generate a multimodal contextual vector by fusing image features and word embeddings. The contextual vector is used to filter irrelevant semantic representations from sentence-level reasoning to generate a fine-grained multimodal representation. Furthermore, a label-distribution-smooth margin loss is proposed to minimize generalization error bound of long-tailed distribution datasets by modifying margin bound of different labels in training set. Based on standard VQA-Rad dataset and PathVQA dataset, the proposed model achieves 0.7605 and 0.5434 on accuracy, 0.7741 and 0.5288 on F1-score, respectively, outperforming a set of state-of-the-art baseline models.

Assuntos

Aprendizado de Máquina , Semântica , Atenção à Saúde , Idioma , Aprendizagem

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA